Home Depot Product Search Relevance

The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

LabGraph Create

This notebook uses the LabGraph create machine learning iPython module. You need a personal licence to run this code.


In [1]:
import graphlab as gl

Load data from CSV files


In [2]:
train = gl.SFrame.read_csv("../data/train.csv")


[INFO] GraphLab Create v1.8.3 started. Logging: /tmp/graphlab_server_1456701323.log
Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
Parsing completed. Parsed 74067 lines in 0.179878 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [3]:
test = gl.SFrame.read_csv("../data/test.csv")


Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
Parsing completed. Parsed 100 lines in 0.213001 secs.
Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
Parsing completed. Parsed 166693 lines in 0.330936 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [4]:
desc = gl.SFrame.read_csv("../data/product_descriptions.csv")


Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
Parsing completed. Parsed 100 lines in 0.531025 secs.
Read 61134 lines. Lines per second: 58160.7
Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
Parsing completed. Parsed 124428 lines in 1.65722 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [5]:
attr = gl.SFrame.read_csv("../data/attributes.csv")


Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/attributes.csv
Parsing completed. Parsed 100 lines in 0.739144 secs.
Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/attributes.csv
Parsing completed. Parsed 2044803 lines in 1.69493 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Data merging and feature engineering


In [6]:
# merge train with description
train = train.join(desc, on = 'product_uid', how = 'left')

In [7]:
# merge test with description
test = test.join(desc, on = 'product_uid', how = 'left')

In [8]:
# if some attributes has no so we don't need them
print len(attr)
attr = attr[attr['value'] != "No"]
print len(attr)


2044803
1952634

In [9]:
# if some attributes has "yes" we compy the value so we can search in it
attr['value'] = attr.apply(lambda x: x['name'] if x['value'] == "Yes" else x['value'])

Let's select brands


In [10]:
brands = attr[attr['name'] == "MFG Brand Name"]

In [11]:
brands.head()


Out[11]:
product_uid name value
100001 MFG Brand Name Simpson Strong-Tie
100002 MFG Brand Name BEHR Premium Textured
DeckOver ...
100003 MFG Brand Name STERLING
100004 MFG Brand Name Grape Solar
100005 MFG Brand Name Delta
100006 MFG Brand Name Whirlpool
100007 MFG Brand Name Lithonia Lighting
100008 MFG Brand Name Teks
100009 MFG Brand Name House of Fara
100010 MFG Brand Name Valley View Industries
[10 rows x 3 columns]

Bullets too


In [12]:
bullets = attr[attr['name'].contains("Bullet")]

In [13]:
# converting bullets to columns
bullets = bullets.unstack(column = ['name', 'value'], new_column_name = "bullets")
bullets = bullets.unpack("bullets")
bullets = bullets.sort("product_uid")
print len(bullets)


86263

In [14]:
# merge train with brands and bullets
train = train.join(brands, on = 'product_uid', how = 'left')
train = train.join(bullets, on = 'product_uid', how = 'left')

In [15]:
# merge test with brands and bullets
test = test.join(brands, on = 'product_uid', how = 'left')
test = test.join(bullets, on = 'product_uid', how = 'left')

TF-IDF with linear regression


In [16]:
def calculateTfIdf(cols, data, searchColTfIdfName):
    for item in xrange(len(cols)):
        colName = cols[item]
        newColNameWordCount = colName + "_word_count"
        newColNameTfIdf = colName + "_tfidf"
        newColDistance = colName + "_distance"
        
        wordCount = gl.text_analytics.count_words(data[colName])
        data[newColNameWordCount] = wordCount
        
        tfidf = gl.text_analytics.tf_idf(data[newColNameWordCount])
        data[newColNameTfIdf] = tfidf
        #print colName
        if searchColTfIdfName != colName:
            data[newColDistance] = data.apply(lambda x: 0 if x[newColNameTfIdf] is None else gl.distances.cosine(x[searchColTfIdfName],x[newColNameTfIdf]))
        
    return data

In [17]:
# columns = ['search_term', 'product_title', 'product_description', 'value', 'bullets.Bullet01',
#          'bullets.Bullet02', 'bullets.Bullet03', 'bullets.Bullet04', 'bullets.Bullet05', 'bullets.Bullet06'
#          , 'bullets.Bullet07', 'bullets.Bullet08', 'bullets.Bullet09', 'bullets.Bullet10', 'bullets.Bullet11'
#          , 'bullets.Bullet12', 'bullets.Bullet13', 'bullets.Bullet14', 'bullets.Bullet15', 'bullets.Bullet16'
#          , 'bullets.Bullet17', 'bullets.Bullet18', 'bullets.Bullet19', 'bullets.Bullet20', 'bullets.Bullet21'
#          , 'bullets.Bullet22']

columns = ['search_term', 'product_title', 'product_description', 'value']

train = calculateTfIdf(columns, train, 'search_term_tfidf')

In [18]:
test = calculateTfIdf(columns, test, 'search_term_tfidf')

In [19]:
featuresDistance = [s for s in train.column_names() if "distance" in s]
print featuresDistance


['search_term_distance', 'product_title_distance', 'product_description_distance', 'value_distance']

In [20]:
#train = train.dropna('value_distance')

In [21]:
model1 = gl.linear_regression.create(train, target = 'relevance', features = featuresDistance)


Linear regression:
--------------------------------------------------------
Number of examples          : 70282
Number of features          : 4
Number of unpacked features : 4
Number of coefficients    : 5
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 1.033884     | 1.933607           | 1.718643             | 0.507793      | 0.501934        |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [22]:
#let's take a look at the weights before we plot
model1.get("coefficients")


Out[22]:
name index value stderr
(intercept) None 3.34869888667 0.0151876483122
search_term_distance None -0.0174618514756 1.47829769014e+13
product_title_distance None -0.595348024821 0.0128923155948
product_description_dista
nce ...
None -0.483697113003 0.0199678645082
value_distance None -0.127583244653 0.00464255002758
[5 rows x 4 columns]


In [23]:
'''
predictions_test = model1.predict(test)
test_errors = predictions_test - test['relevance']
RSS_test = sum(test_errors * test_errors)
print RSS_test
'''


Out[23]:
"\npredictions_test = model1.predict(test)\ntest_errors = predictions_test - test['relevance']\nRSS_test = sum(test_errors * test_errors)\nprint RSS_test\n"

In [24]:
predictions_test = model1.predict(test)
predictions_test


Out[24]:
dtype: float
Rows: 166693
[2.163608582030827, 2.1420705041883856, 2.3702256434364006, 2.375329695996665, 2.3321203991937733, 2.1420705041883856, 2.3519518753144855, 2.359421253304835, 2.178561192643488, 2.666910931658947, 2.507026409672497, 2.412125970203885, 2.5170007078538994, 2.3258349990880607, 2.2242032327035375, 2.3941878695971086, 2.1420705041883856, 2.6140855624161623, 2.1699813789661144, 2.2065949192062444, 2.717407736231517, 2.712865824093638, 2.1779403904932306, 2.337502788278624, 2.1420705041883856, 2.2617218748848984, 2.1420705041883856, 2.3199329871957146, 2.2344124134245718, 2.385991303459093, 2.5326287324153878, 2.4292165976367617, 2.2066803567199864, 2.335888999500826, 2.19339399150458, 2.1420705041883856, 2.2400855057732123, 2.167872304387674, 2.686237771842258, 2.5971778998122836, 2.1632514252912123, 2.3006226583130776, 2.2100150660265516, 2.1420705041883856, 2.4489243449769083, 2.3715994152817577, 2.423937656116298, 2.4740470887648307, 2.6614303865830693, 2.2622184754650583, 2.1495636190535095, 2.1420705041883856, 2.170407029327005, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.14249141903142, 2.1420705041883856, 2.1420705041883856, 2.174982070139289, 2.1752787927457464, 2.1420705041883856, 2.2784681803740305, 2.482949286137385, 2.2329231441394812, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.2786148712217598, 2.1420705041883856, 2.617211637992733, 2.4196822993012703, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.1420705041883856, 2.470706963362782, 2.2446674927471233, 2.325098349640318, 2.2099398367301855, 2.212277355188578, 2.3397732932525357, 2.4953560472968697, 2.5118006031281985, 2.442928045771276, 2.536508440492266, 2.3429026352796365, 2.1420705041883856, 2.610918338740749, 2.3293265197071764, 2.267742429379897, 2.244849860123413, 2.422150849969186, 2.3325964347766477, 2.341568501633621, ... ]

In [25]:
submission = gl.SFrame(test['id'])

In [26]:
submission.add_column(predictions_test)
submission.rename({'X1': 'id', 'X2':'relevance'})


Out[26]:
id relevance
1 2.16360858203
4 2.14207050419
5 2.37022564344
6 2.375329696
7 2.33212039919
8 2.14207050419
10 2.35195187531
11 2.3594212533
12 2.17856119264
13 2.66691093166
[166693 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [27]:
submission['relevance'] = submission.apply(lambda x: 3.0 if x['relevance'] > 3.0 else x['relevance'])
submission['relevance'] = submission.apply(lambda x: 1.0 if x['relevance'] < 1.0 else x['relevance'])

In [28]:
submission['relevance'] = submission.apply(lambda x: str(x['relevance']))

In [29]:
submission.export_csv('../data/submission.csv', quote_level = 3)

In [ ]:
#gl.canvas.set_target('ipynb')